When fitting regression models with multiple explanatory variables, the interpretation of an explanatory variable is made in association with the other variables. For example, if we wanted to model income then we may consider an individual’s level of education, and perhaps the wealth of their parents. Then, when interpreting the effect an individuals level of education has on their income, we would also be considering the effect of the wealth of their parents simultaneously, as these two variables are likely to be related.
The regression model we will be considering contains the following variables:
What is the mean credit Limit?
What is the median credit Balance?
What is the percent credit card holders with income greater than $57,470?
What is the correlation coefficient for the linear relationship between Balance and Limit?
What would be the verbal interpretation of the correlation coefficient for the linear relationship between Balance and Income?
Collinearity (or multicolinnearity) occurs when an explanatory variable within a multiple regression model can be linearly predicted from the other explanatory variables with a high level of accuracy. For example, in this case, since Limit and Income are highly correlated, we could take a good guess as to an individual’s Income based on their Limit. That is, having one or momre higly correlated explantory variables within a multiple regression model essentially provides us with redundant information. Normally, we would remove one of the highly correlated variables, but for the purpose of this example we will ignore the potenital issue.
p1 <- ggplot(credit, aes(x = Limit, y = Balance)) +
geom_point() +
labs(x = "Credit limit [$]",
y = "Credit card balance [$]",
title = "Relationship between balance and credit limit") +
geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(credit, aes(x = Income, y = Balance)) +
geom_point() +
labs(x = "Credit income [$]",
y = "Credit card balance [$]",
title = "Relationship between income and income") +
geom_smooth(method = "lm", se = FALSE)
grid.arrange(p1, p2, layout_matrix = matrix(seq_len(1*2), nrow = 1, ncol = 2))
Relationship between balance and explanatory variables: credit limit and income.
What is the relationship between balance and credit limit?
What is the relationship between balance and income?
The two scatterplots in Figure focus on the relationship between the outcome variable Balance and each of the explanatory variables independently. In order to get an idea of the relationship between all three variables we can use the plot_ly function within the plotly library to plot a 3D scatterplot as follows.
3D scatterplot between balance and explanatory variables: credit limit and income.
The multiple regression model we will be fitting to the credit balance data is given as:
\[y_i = \alpha + \beta_1x_{1i} + \beta_2x_{2i} + \epsilon_i, ~~~ \epsilon \sim N(0, \sigma^2)\]
where
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -385.179 | 19.465 | -19.789 | 0 | -423.446 | -346.912 |
| Limit | 0.264 | 0.006 | 44.955 | 0 | 0.253 | 0.276 |
| Income | -7.663 | 0.385 | -19.901 | 0 | -8.420 | -6.906 |
Simpson’s Paradox: From Figure we see positive relationships between credit card balance against both credit limit and income. Why do then get a negative coefficient for income (\(\widehat{\beta_{income}} = -7.66\))? This is due to a phenomenon known as Simpson’s Paradox. This occurs when there are trends within different catagories (or groups) of data, but that these trends disappear when the categories are grouped as a whole.
Now we need to asses our model assumptions:
First, we need to obtain the fitted values and residuals from our regression model:
regression.points <- get_regression_points(balance.model)
Recall that get_regression_points provides us with values of the:
balance)Limit) and \(x_2\) (Income)We can asses our first two model assumptions by producing scatterplots of our residuals against each of our explanaotry variables.
p3 <- ggplot(regression.points, aes(x = Limit, y = residual)) +
geom_point() +
labs(x = "Credit limit [$]",
y = "Residual",
title = "Residuals vs. credit limit") +
geom_hline(yintercept = 0, col = "blue", size = 1)
p4 <- ggplot(regression.points, aes(x = Income, y = residual)) +
geom_point() +
labs(x = "Credit income [$]",
y = "Residual",
title = "Residuals vs. Income") +
geom_hline(yintercept = 0, col = "blue", size = 1)
grid.arrange(p3, p4, layout_matrix = matrix(seq_len(1*2), nrow = 2, ncol = 1))
Residual plots of credit limit and income.